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Abstract 

Neuroscientists have long criticised deep learn¬ 
ing algorithms as incompatible with current 
knowledge of neurobiology. We explore more bi¬ 
ologically plausible versions of deep representa¬ 
tion learning, focusing here mostly on unsuper¬ 
vised learning but developing a learning mecha¬ 
nism that could account for supervised, unsuper¬ 
vised and reinforcement learning. The starting 
point is that the basic learning rule believed to 
govern synaptic weight updates (Spike-Timing- 
Dependent Plasticity) arises out of a simple up¬ 
date rule that makes a lot of sense from a machine 
learning point of view and can be interpreted 
as gradient descent on some objective function 
so long as the neuronal dynamics push firing 
rates towards better values of the objective func¬ 
tion (be it supervised, unsupervised, or reward- 
driven). The second main idea is that this corre¬ 
sponds to a form of the variational EM algorithm, 
i.e., with approximate rather than exact posteri¬ 
ors, implemented by neural dynamics. Another 
contribution of this paper is that the gradients re¬ 
quired for updating the hidden states in the above 
variational interpretation can be estimated using 
an approximation that only requires propagating 
activations forward and backward, with pairs of 
layers learning to form a denoising auto-encoder. 
Finally, we extend the theory about the proba¬ 
bilistic interpretation of auto-encoders to justify 
improved sampling schemes based on the gener¬ 
ative interpretation of denoising auto-encoders, 
and we validate all these ideas on generative 
learning tasks. 

1. Introduction 

Deep learning and artificial neural networks have taken 
their inspiration from brains, but mostly for the form of 
the computation performed (with much of the biology, such 
as the presence of spikes remaining to be accounted for). 
However, what is lacking currently is a credible machine 
learning interpretation of the learning rules that seem to 


exist in biological neurons that would explain efficient joint 
training of a deep neural network, i.e., accounting for credit 
assignment through a long chain of neural connections. 
Solving the credit assignment problem therefore means 
identifying neurons and weights that are responsible for 
a desired outcome and changing parameters accordingly. 
Whereas back-propagation offers a machine learning an¬ 
swer, it is not biologically plausible, as discussed in the 
next paragraph. Finding a biologically plausible machine 
learning approach for credit assignment in deep networks 
is the main long-term question to which this paper con¬ 
tributes. 


Let us first consider the claim that state-of-the-art deep 
learning algorithms rely on mechanisms that seem bio¬ 
logically implausible, such as gradient back-propagation, 
i.e., the mechanism for computing the gradient of an ob¬ 
jective function with respect to neural activations and pa¬ 
rameters. The following difficulties can be raised regard¬ 
ing the biological plausibility of back-propagation: (1) the 
back-propagation computation (coming down from the out¬ 
put layer to lower hidden layers) is purely linear, whereas 
biological neurons interleave linear and non-linear opera¬ 
tions, (2) if the feedback paths known to exist in the brain 
(with their own synapses and maybe their own neurons) 
were used to propagate credit assignment by backprop, 
they would need precise knowledge of the derivatives of 
the non-linearities at the operating point used in the cor¬ 
responding feedforward computation on the feedforward 
patfQ (3) similarly, these feedback paths would have to 
use exact symmetric weights (with the same connectivity, 
transposed) of the feedforward connections 0(4) real neu¬ 
rons communicate by (possibly stochastic) binary values 
(spikes), not by clean continuous values, (5) the compu¬ 
tation would have to be precisely clocked to alternate be¬ 
tween feedforward and back-propagation phases (since the 
latter needs the former’s results), and (6) it is not clear 
where the output targets would come from. The approach 


1 and with neurons not all being exactly the same, it could be 
difficult to match the right estimated derivatives 

2 this is known as the weight transport problem 
2014 ) 
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proposed in this paper has the ambition to address all these 
issues, although some question marks as to a possible bio¬ 
logical implementations remain, and of course many more 
detailed elements of the biology that need to be accounted 
for are not covered here. 


Note that back-propagation is used not just for classical su¬ 
pervised learning but also for many unsupervised learning 
algorithms, including all kinds of auto-encoders: sparse 
auto-encoders (Ranzat o et al.\ |2007 Goodfellow et al] 
2009), denoising auto-e ncoders (|Vincent et al 2008| >, con¬ 
tractive auto-encoders (Rif ai et al. ||201l| l, and more re¬ 
cently, variational auto-encoders (Kingma and Welling 


2014). Other unsupervised learning algorithms exist which 


do not rely on back-propagation, such as the various Boltz- 


mann machine learning algorithms (Hinton and Sejnowski 

1986 

Smolensky 1986] Hinton et al. 2006 Salakhutdinov 

and 1 

inton 

2009). Boltzmann machines are probably the 


most biologically plausible learning algorithms for deep ar¬ 
chitectures that we currently know, but they also face sev¬ 
eral question marks in this regard, such as the weight trans¬ 
port problem ((3) above) to achieve symmetric weights, and 
the positive-phase vs negative-phase synchronization ques¬ 
tion (similar to (5) above). 


Our starting point (Sec. |2]» proposes an interpretation of the 
main learning rule observed in biological synapses: Spike- 
Timing-Dependent Plasticity (STDP). Inspired by earlier 
ideas (Xie and Seung 2000| Hinton 2007), we first show 
via both an intuitive argument and a simulation that STDP 
could be seen as stochastic gradient descent if only the neu¬ 
ron was driven by a feedback signal that either increases or 
decreases the neuron’s firing rate in proportion to the gra¬ 
dient of an objective function with respect to the neuron’s 
voltage potential. 


In Sec. [3] we present the first machine learning interpreta¬ 
tion of STDP that gives rise to efficient credit assignment 
through multiple layers. We first argue that the above in¬ 
terpretation of STDP suggests that neural dynamics (which 
creates the above changes in neuronal activations thanks 
to feedback and lateral connections) correspond to infer¬ 
ence towards neural configurations that are more consistent 
with each other and with the observations (inputs, targets, 
or rewards). This view is analogous to the interpretation of 
inference in Boltzmann machines while avoiding the need 
to obtain representative samples from the stationary distri¬ 
bution of an MCMC. Going beyond Hinton’s proposal, it 
naturally suggests that the training procedure corresponds 
to a form of variational EM (Nea l and Hinton[ [1999) (see 
Sec(3]), possibly based on MAP (maximum a posteriori) 
or MCMC (Markov Chain Monte-Carlo) approximations. 
In Sec. [4] we show how this mathematical framework sug¬ 
gests a training procedure for a deep directed generative 
network with many layers of latent variables. However, the 


above interpretation would still require to compute some 
gradients. Another contribution (Sec. [6]l is to show that 
one can estimate these gradients via an approximation that 
only involves ordinary neural computation and no explicit 
derivatives, following previous (unpublished) work on tar¬ 
get propagation (Bengio| |2014| Lee et al. 20141. We in¬ 
troduce a novel justification for difference target propaga¬ 
tion (Lee et al. 2014[ ), exploiting the fact that the proposed 
learning mechanism can be interpreted as training a denois¬ 
ing auto-encoder. As discussed in Sec. [5] these alternative 
interpretations of the model provide different ways to sam¬ 
ple from it, and we found that better samples could be ob¬ 
tained. 


2. STDP as Stochastic Gradient Descent 

Spike-Timing-Dependent Plasticity or STDP is believed to 
be the main form of synaptic change in neurons ( |Markram| 
land Sakmann||l 99~5||Gerstner et q/.||l 996[ i and it relates the 
expected change in synaptic weights to the timing differ¬ 
ence between post-synaptic spikes and pre-synaptic spikes. 
Although it is the result of experimental observations in 
biological neurons, its interpretation as part of a learning 
procedure that could explain learning in deep networks re¬ 
mains unclear. Xie and Seung ( 2000) nicely showed how 
STDP can correspond to a differential anti-Hebbian plas¬ 
ticity, i.e., the synaptic change is proportional the product 
of pre-synaptic activity and the temporal derivative of the 
post-synaptic activity. The question is how this could make 
sense from a machine learning point of view. This paper 
aims at proposing such an interpretation, starting from the 
general idea introduced by Hinton (2007 1, anchoring it in 
a novel machine learning interpretation, and extending it to 
deep unsupervised generative modeling of the data. 


What has been observed in STDP is that the weights change 
if there is a pre-synaptic spike in the temporal vicinity of 
a post-synaptic spike: that change is positive if the post- 
synaptic spike happens just after the pre-synaptic spike, 
negative if it happens just before. Furthermore, the amount 
of change decays to zero as the temporal difference be¬ 
tween the two spikes increases in magnitude. We are thus 
interested in this temporal window around a pre-synaptic 
spike during which we could have a post-synaptic spike, 
before or after the pre-synaptic spike. 


We propose a novel explanation for the STDP curve as a 
consequence of an actual update equation which makes a 
lot of sense from a machine learning perspective: 

A Wij oc Sty, (1) 

where V :j indicates the temporal derivative of V 3 , S, indi¬ 
cates the pre-synaptic spike (from neuron i), and V :i indi¬ 
cates the post-synaptic voltage potential (of neuron j). 


To see how the above update rule can give rise to STDP, 
consider the average effect of the rest of the inputs into 
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neuron j, which induces an average temporal change Vj 
of the post-synaptic potential (which we assume approx¬ 
imately constant throughout the duration of the window 
around the pre-synaptic firing event), and assume that at 
time 0 when S) spikes, Vj is below the firing threshold. 
Let us call AT the temporal difference between the post- 
synaptic spike and the pre-synaptic spike. 
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Figure 1. Result of simulation around pre-synaptic spike (time 0) 
showing indirectly the effect of a change in the rate of change in 
the post-synaptic voltage, Vj on both the average time difference 
between pre- and post-synaptic spikes (horizontal axis, AT) and 
the average weight change (vertical axis, AWT,), when the latter 
follows Eq. [T] This corresponds very well to the observed rela¬ 
tionship between AT and AWij in the biological literature. 

First, let us consider the case where Vj > 0, i.e., the volt¬ 
age potential increases. Depending on the magnitude of 
Vj, it will take more or less time for Vj to reach the firing 
threshold, i.e. more time for smaller Vj. Hence a longer 
AT corresponds to a smaller AWij, and a positive AT to 
a positive AWij, as observed for STDP. 


Second, let us consider the case where V 3 < 0, i.e., the 
voltage potential has been decreasing (remember that we 
are considering the average effect of the rest of the in¬ 
puts into neuron j, and assuming that the temporal slope 
of that effect is locally constant). Thus, it is likely that 
earlier before the pre-synaptic spike the post-synaptic volt¬ 
age Vj had been high enough to be above the firing thresh¬ 
old. How far back in the past again depends mono- 
tonically on Vj. Hence a negative AT corresponds to 
a negative AWij and a more negative AT corresponds 
to a AWij that is smaller in magnitude. This corre¬ 
sponds perfectly to the kind of relationship that is ob¬ 


served by biologists with STDP (see Figure 7 of Bi and 
Poo (| 1998)1 or Figure 1 of |Sjostrom and Gerstner| p010), 


e.g. at http://www.scholarpedia.org/article/ 
Spike-timin g_dependent_plasticity). In a simula¬ 
tion inspired by the above analysis, we observe essentially the 
same curve relating the average AT and the AWij that is associ¬ 
ated with it, as illustrated in Figure [T] 


Clearly, the consequence of Eq.QJis that if the change A Vj cor¬ 
responds to improving some objective function ,J, then STDP 


corresponds to approximate stochastic gradient descent in 
that objective function. With this view, STDP would implement 
the delta rule (gradient descent on a one-layer network) if the post- 
synaptic activation changes in the direction of the gradient. 


3. Variational EM with Learned Approximate 
Inference 

To take advantage of the above statement, the dynamics of the 
neural network must be such that neural activities move towards 
better values of some objective function J. Hence we would like 
to define such an objective function in a way that is consistent with 
the actual neural computation being performed (for fixed weights 
W), in the sense that the expected temporal change of the voltage 
potentials approximately corresponds to increases in J. In this pa¬ 
per, we are going to consider the voltage potentials as the central 
variables of interest which influence J and consider them as la¬ 
tent variables V (denoted h below to keep machine learning inter¬ 
pretation general), while we will consider the actual spike trains 
S as non-linear noisy corruptions of V, a form of quantization 
(with the “noise level” controlled either by the integration time or 
the number of redundant neurons in an ensemble (Legenstein and 
[M aass||2014) ). This view makes the application of the denoising 
auto-encoder theorems discussed in Sec.[5]more straightforward. 

The main contribution of this paper is to propose and give sup¬ 
port to the hypothesis that J comes out of a variational bound on 
the likelihood of the data. Variational bounds have been proposed 
to justify various learning algorithms for generative models l |Hin-| 
ton et al. 19951 (Sec. 0- To keep the mapping to biology open, 
consider such bounds and the associated criteria that may be de¬ 
rived from them, using an abstract notation with observed vari¬ 
able x and latent variable h. If we have a model p(x, h ) of their 
joint distribution, as well as some approximate inference mech¬ 
anism defining a conditional distribution q*(H\x), the observed 
data log-likelihood logp(a;) can be decomposed as 


logp(*) = log p(x) ^2 q* (h\x) 

h 

xr- *tu\ m P{x,h)q*{h\x) 

(%)log pWrc)g . (t|l) 

=E q * (HM [logp(x, H)\ + H{q*(H\x)] 
+ KL(q*(H\x)\\p(H\x)), 


( 2 ) 


where H[] denotes entropy and KL(||) the Kullback-Leibler 
(KL) divergence, and where we have used sums but integrals 
should be considered when the variables are continuous. Since 
both the entropy and the KL-divergence are non-negative, we can 
either bound the log-likelihood via 

logp(x) > E q , (H{x) [\ogp{x,H)] +H[g*(iT|a:)], (3) 

or if we care only about optimizing p, 

logp(at) > E q *( H \ x )[\ogp(x,H)]. (4) 

The idea of variational bounds as proxies for the log-likelihood is 
that as far as optimizing p is concerned, i.e., dropping the entropy 
term which does not depend on p, the bound becomes tight when 
q*(H\x) = p(H\x). This suggests that q*{H\x) should approx¬ 
imate p(H\x). Fixing q* (H |a;) = p(H\x) and optimizing p with 
q fixed is the EM algorithm. Here (and in general) this is not 
possible so we consider variational methods in which q*(H\x) 
approximates but does not reach p(H\x). This variational bound 
has recently been used to justify another biologically plausible 
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update rule ( [Rezende and Gerstner[ 2014), which however relied 
on the REINFORCE algorithm ( [Williams] |1992) rather than on 
inference to obtain credit assignment to internal neurons. 

We propose to decompose q*(H\x) in two components: para¬ 
metric initialization qo(H\x) = q(H\x) and iterative inference, 
implicitly defining q*(H\x) = qr(H\x) via a deterministic or 
stochastic update, or transition operator 

qt{H\x) = A(x) qt~i(H\x). (5) 


The variational bound suggests that A(x) should gradually bring 
qt(H\x) closer to p(H\x). At the same time, to make sure that 
a few steps will be sufficient to approach p(H\x), one may add a 
term in the objective function to make qo (H\x) closer to p(H\x), 
as well as to encourage p(x, h) to favor solutions p(H\x) that can 
be easily approximated by qt(H\x) even for small t. 

For this purpose, consider as training objective a regularized vari¬ 
ational MAP-EM criterion (for a given x)\ 

J = \ogp(x, h) + a\ogq(h\x), (6) 


where h is a free variable (for each x) initialized from q(H\x) 
and then iteratively updated to approximately maximize J. The 
total objective function is just the average of J over all exam¬ 
ples after having performed inference (the approximate maxi¬ 
mization over h for each x). A reasonable variant would not 
just encourage q = qo to generate h (given a:), but all the qf s 
for t > 0 as well. Alternatively, the iterative inference could 
be performed by stochastically increasing J, i.e., via a Markov 
chain which may correspond to probabilistic inference with spik¬ 
ing neurons (Pecevski et al. |2011) . The corresponding varia¬ 
tional MAP or variational MCMC algorithm would be as in Algo¬ 
rithm [T] For the stochastic version one would inject noise when 
updating h. Variational MCMC jde Freitas et al. | |200 1 1 can be 
used to approximate the posterior, e.g., as in the model from |Sal-] 
imans et a/. | < [20T4) t. However, a rejection step does not look very 
biologically plausible (both for the need of returning to a previ¬ 
ous state and for the need to evaluate the joint likelihood, a global 
quantity). On the other hand, a biased MCMC with no rejection 
step, such as the stochastic gradient Langevin MCMC of |Welling| 
|and Teh|(j2011|) can work very well in practice. 


Algorithm 1 Variational MAP (or MCMC) SGD algorithm 
for gradually improving the agreement between the values 
of the latent variables h and the observed data x. q{h\x ) is 
a learned parametric initialization for h, p(h) is a paramet¬ 
ric prior on the latent variables, and p{x\h) specifies how to 
generate x given h. Objective function J is defined in Eq. [6] 
Learning rates 5 and e respectively control the optimization 
of h and of parameters 9 (of both q and p). 


Initialize h ~ q(h\x) 
for t = 1 to T do 

6 h A- h + (optional: add noise for MCMC) 

end for 

6 9 + e W 


the level of the microcircuit of cortex (i.e., feedforward and feed¬ 
back connections do not land in the same type of cells). Further¬ 
more, the feedforward connections form a directed acyclic graph 
with nodes (areas) updated in a particular order, e.g., in the vi¬ 
sual cortex ( Felleman and Essen 1991). So consider Algorithm[I] 
with h decomposed into multiple layers, with the conditional in¬ 
dependence structure of a directed graphical model structured as 
a chain, both for p (going down) and for q (going up): 

p(x,h) = p{x\hf' V> ) ^ ]^[ p(/i tfe '|/i (fc+1 ' 1 )^ p(h ( ' M ' 1 ) 

M -1 

q(h\x) = q{h w \x) ]^[ q{h (k+1) \h {k) ). (7) 

fc=i 

This clearly decouples the updates associated with each layer, 
for both h and 6, making these updates “local” to the layer k, 
based on “feedback” from layer k — 1 and k + 1. Nonetheless, 
thanks to the iterative nature of the updates of h, all the layers are 
interacting via both feedforward (q(/i* fc) |/i^ fe_1) )) and feedback 
(p(/i ( -^|/i ( ' fc+1 )) paths. Denoting x = h ^ to simplify notation, 
the h update would thus consist in moves of the form 

h w + y (\o g (p(h^\h w )p(h w \h (k+1) )) 

+ a log {q(h W \h (k - 1) )q(h ik+1) \h W ))') , 

( 8 ) 

where a is as in Eq. [6] No back-propagation is needed for the 
above derivatives when h lk ' 1 is on the left hand side of the con¬ 
ditional probability bar. Sec. [6] deals with the right hand side 
case. For the left hand side case, e.g., p(hf' k ' > \h^ kJrl ' > ) a condi¬ 
tional Gaussian with mean p and variance a 2 , the gradient with 

respect to h^ is simply . Note that there is an interesting 

interpretation of such a deep model: the layers above provide 
a complex implicitly defined prior for p(h^). 

5. Alternative Interpretations as Denoising 
Auto-Encoder 

By inspection of Algorithm]!] one can observe that this algorithm 
trains p(x\h) and q(h\x) to form complementary pairs of an auto¬ 
encoder (since the input of one is the target of the other and vice- 
versa). Note that from that point of view any of the two can act as 
encoder and the other as decoder for it, depending on whether we 
start from h or from x. In the case of multiple latent layers, each 
pair of conditionals q{hS k+v> \hP*^) and \h (k+1 ^) forms a 

symmetric auto-encoder, i.e., either one can act as the encoder and 
the other as the corresponding decoder, since they are trained with 
the same (h. l ' k \ h^ k+1 ^) pairs (but with reversed roles of input and 
target). 

In addition, if noise is injected, e.g., in the form of the quantiza¬ 
tion induced by a spike train, then the trained auto-encoders are 
actually denoising auto-encoders, which means that both the en¬ 
coders and decoders are contractive : in the neighborhood of the 
observed (x, h) pairs, they map neighboring “corrupted” values 
to the “clean” (*, h) values. 


4. Training a Deep Generative Model 

There is strong biological evidence of a distinct pattern of con¬ 
nectivity between cortical areas that distinguishes between “feed¬ 
forward” and “feedback” connections ( Douglas et al. | |1989) at 


5.1. Joint Denoising Auto-Encoder with Latent 
Variables 

This suggests considering a special kind of “joint” denoising auto¬ 
encoder which has the pair (x, h) as “visible” variable, an auto¬ 
encoder that implicitly estimates an underlying p(x,h). The 
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transition operato^ for that joint visible-latent denoising auto¬ 
encoder is the following in the case of a single hidden layer: 

(x, h) <— corrupt^*, h) 

h ~ q(h\x) x~p(x\h), (9) 


5.2. Latent Variables as Corruption 

There is another interpretation of the training procedure, also as a 
denoising auto-encoder, which has the advantage of producing a 
generative procedure that is the same as the inference procedure 
except for x being unclamped. 


where the corruption may correspond to the stochastic quantiza¬ 
tion induced by the neuron non-linearity and spiking process. In 
the case of a middle layer h ^ in a deeper model, the transition 
operator must account for the fact that h ^ can either be recon¬ 
structed from above or from below, yielding, with probability say 

1 

2 ’ 

h (k) ~ p(h (k) \h (k+1) ), (10) 

and with one minus that probability, 

h (k) ~ q (h (k) \h (k ~ 1) ). (11) 

Since this interpretation provides a different model, it also pro¬ 
vides a different way of generating samples. Especially for shal¬ 
low, we have found that better samples could be obtained in this 
way, i.e., running the Markov chain with the above transition op¬ 
erator for a few steps. 


There might be a geometric interpretation for the improved qual¬ 
ity of the samples when they are obtained in this way, compared 
to the directed generative model that was defined earlier. De¬ 
note q*(x) the empirical distribution of the data, which defines 
a joint q*(h,x) = q*(x)q*{h\x). Consider the likely situation 
where p(x, h) is not well matched to q*(h, x) because for exam¬ 
ple the parametrization of p(h) is not powerful enough to capture 
the complex structure in the empirical distribution q* (h) obtained 
by mapping the training data through the encoder and inference 
q*(h\x). Typically, q*{x) would concentrate on a manifold and 
the encoder would not be able to completely unfold it, so that 
q* (h) would contain complicated structure with pockets or man¬ 
ifolds of high probability. If p(h) is a simple factorized model, 
then it will generate values of h that do not correspond well to 
those seen by the decoder p(x\h) when it was trained, and these 
out-of-manifold samples in h-space are likely to be mapped to 
out-of-manifold samples in a:-space. One solution to this problem 
is to increase the capacity of p(h) (e.g., by adding more layers 
on top of h ). Another is to make q(h\x) more powerful (which 
again can be achieved by increasing the depth of the model, but 
this time by inserting additional layers below h). Now, there is 
a cheap way of obtaining a very deep directed graphical model, 
by unfolding the Markov chain of an MCMC-based generative 
model for a fixed number of steps, i.e., considering each step of 
the Markov chain as an extra “layer” in a deep directed generative 
model, with shared parameters across these layers. As we have 
seen that there is such an interpretation via the joint denoising 
auto-encoder over both latent and visible, this idea can be im¬ 
mediately applied. We know that each step of the Markov chain 
operator moves its input distribution closer to the stationary distri¬ 
bution of the chain. So if we start from samples from a very broad 
(say factorized) prior p(h) and we iteratively encode/decode them 
(injecting noise appropriately as during training) by successively 
sampling fromp(a;|h) and then from q(h\x), the resulting h sam¬ 
ples should end up looking more like those seen during training 
(i.e., from q*(h)). 


'See Theorem 1 from 


Bengio et al. (20131 for the generative 


interpretation of denoising auto-encoders: it basically states that 
one can sample from the model implicitly estimated by a denois¬ 
ing auto-encoder by simply alternating noise injection (corrup¬ 
tion), encoding and decoding, these forming each step of a gener¬ 
ative Markov chain. 


We return again to the generative interpretation of the denoising 
criterion for auto-encoders, but this time we consider the non- 
parametric process q*(h\x) as a kind of corruption of x that yields 
the h used as input for reconstructing the observed x via p(x\h). 
Under that interpretation, a valid generative procedure consists 
at each step in first performing inference, i.e., sampling h from 
q*(h\x), and second sampling from p(x\h). Iterating these steps 
generates ads according to the Markov chain whose stationary dis¬ 
tribution is an estimator of the data generating distribution that 
produced the training ads i jBengio et fl/.[|2013fr . This view does 
not care about how q*(h\x)~is constructed, but it tells us that if 
p(x\h) is trained to maximize reconstruction probability, then we 
can sample in this way from the implicitly estimated model. 

We have also found good results using this procedure (Algo- 
rithm[2]below), and from the point of view of biological plausi¬ 
bility, it would make more sense that “generating” should involve 
the same operations as “inference”, except for the input being ob¬ 
served or not. 


6. Targetprop instead of Backprop 

In Algorithm [I] and the related stochastic variants Eq. [8] suggests 
that back-propagation (through one layer) is still needed when 
h w is on the right hand side of the conditional probability bar, 
e.g., to compute dp( ' h{ gh (kj h ^ ■ Such a gradient is also the basic 
building block in back-propagation for supervised learning: we 
need to back-prop through one layer, e.g. to make hS k ' 1 more 
“compatible” with h( k ~ 1 \ This provides a kind error signal, 
which in the case of unsupervised learning comes from the sen¬ 
sors, and in the case of supervised learning, comes from the layer 
holding the observed “target”. 


Based on recent theoretical results on denoising auto-encoders, 
we propose the following estimator (up to a scaling constant) of 
the required gradient, which is related to previous work on “target 
propagation” I Bengio] |2014| [Lee et a l. 2014) or targetprop for 
short. To make notation simpler, we focus below on the case of 
two layers h and x with “encoder” q(h\x) and “decoder” p(x\h), 
and we want to estimate 91 °sp ( x \ h ) _ We start with the special 
case where p(x\h) is a Gaussian with mean g(h) and q(h\x) is 
Gaussian with mean f(x), i.e., / and g are the deterministic com¬ 
ponents of the encoder and decoder respectively. The proposed 
estimator is then 


Ah= (12) 

where of, is the variance of the noise injected in q(h\x). 

Let us now justify this estimator. Theorem 2 by |Alain a nd Bengio 
< [2013| ) states that in a denoising auto-encoder with reconstruc¬ 
tion function r(x) = decode(encode(a:)), a well-trained auto¬ 
encoder estimates the log-score via the difference between its re¬ 
construction and its input: 


r(x) — x 9 log p(x) 

a 2 > dx 

where o 2 is the variance of injected noise, and p(x) is the im¬ 
plicitly estimated density. We are now going to consider two de¬ 
noising auto-encoders and apply this theorem to them. First, we 
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note that the gradient dl ° E J^ ,l) that we wish to estimate can be 
decomposed as follows: 

d\ogp{x\h) _ d\ogp{x,h) 9log p(h) 

dh dh dh 


Hence it is enough to estimate ai °sp( x ’ h ) as we n as dlo ^ (h ^ . 
The second one can be estimated by considering the auto-encoder 
which estimates p(h) implicitly and for which g is the encoder 
(with g(h) the “code” for h) and / is the decoder (with f(g(h )) 
the “reconstruction” of h). Hence we have that ■ , is an 

estimator of aio sp( h ) _ 


The other gradient can be estimated by considering the joint 
denoising auto-encoder over (a;, h ) introduced in the previous 
section. The (noise-free) reconstruction function for that auto¬ 
encoder is 


r(x,h) = ( g{h),f(x )). 


Hence h is an estimator of dl °sp( x ’ h '> Combining the two 

h 

estimators, we get 


(/( x) - h) _ (,f(g{h)) - h ) _ f{x) - f{g(h)) 

/ t -2 ^2 


which corresponds to Eq.|12| 



Figure 2. The optimal h for maximizing p(x\h) is h s.t. g(h) = 
x. Since the encoder / and decoder g are approximate inverses 
of each other, their composition makes a small move Ax. Eq.|12| 
is obtained by assuming that by considering an x at x — A and 
applying fog, one would approximately recover x, which should 
be true if the changes are small and the functions smooth (see |Lee| 
|and Be ngio (2014) for a detailed derivation). 

Another way to obtain the same formula from a geometric per¬ 
spective is illustrated in Figure [2] It was introduced in |Lee and| 
Bengio ( |2014^ in the context of a backprop-free algorithm for 
training a denoising auto-encoder. 

7. Related Work 

An important inspiration for the proposed framework is the bio¬ 
logical implementation of back-propagation proposed by |Hinton| 
< |2007| >. In that talk, Hinton suggests that STDP corresponds to 
a gradient update step with the gradient on the voltage potential 
corresponding to its temporal derivative. To obtain the supervised 
back-propagation update in the proposed scenario would require 
symmetric weights and synchronization of the computations in 
terms of feedforward and feedback phases. 

Our proposal introduces a novel machine learning interpretation 
that also matches the STDP behavior, based on a variational EM 
framework, allowing us to obtain a more biologically plausible 
mechanism for deep generative unsupervised learning, avoiding 
the need for symmetric weights, and introducing a novel method 
to obtain neural updates that approximately propagate gradients 
and move towards better overall configurations of neural activity 


(with difference target-prop). There is also an interesting con¬ 
nection with an earlier proposal for a more biologically plausible 
implementation of supervised back-propagation ( |Xie and Seung| 
|2003| > which also relies on iterative inference (a deterministic re¬ 
laxation in that case), but needs symmetric weights. 

Another important inspiration is Predictive Sparse Decomposition 
(PSD) dKavuk cuoglu et fl/.|[2008j >. PSD is a special case of Al¬ 
gorithm [I] when there is only one layer and the encoder q(h\x), 
decoder p[x\h), and prior p(h) have a specific form which makes 
p(x, h ) a sparse coding model and q(h\x) a fast parametric ap¬ 
proximation of the correct posterior. Our proposal extends PSD 
by providing a justification for the training criterion as a varia¬ 
tional bound, by generalizing to multiple layers of latent variables, 
and by providing associated generative procedures. 

The combination of a parametric approximate inference machine 
(the encoder) and a generative decoder (each with possibly several 
layers of latent variables) is an old theme that was started with the 
Wake-Sleep algorithm plinton et «/. [| 1995} and which finds very 
interesting instantiations in the variational auto-encoder (Kingma 
and Welling 2014 Kingma et al. 20141 and the reweighted wake- 
sleep algorithm {Bornschein and Bengio| |20I4} . Two impor¬ 
tant differences with the approach proposed here is that here we 
avoid back-propagation thanks to an inference step that approxi¬ 
mates the posterior. In this spirit, see the recent work introducing 
MCMC inference for the variational auto-encoder lSalimans et ail 

mty - 

The proposal made here also owes a lot to the idea of target prop¬ 
agation introduced in Bengio ( 2 014} ; [Lee e t al. (2014), to which 
it adds the idea that in order to find a target that is consistent with 
both the input and the final output target, it makes sense to per¬ 
form iterative inference, reconciling the bottom-up and top-down 
pressures. Addressing the weight transport problem (the weight 
symmetry constraint) was also done for the supervised case using 
feedback alignment (Lilli crap et a/. | [2014} : even if the feedback 
weights do not exactly match the feedforward weights, the lat¬ 
ter learn to align to the former and “back-propagation” (with the 
wrong feedback weights) still works. 

The targetprop formula avoiding back-propagation through one 
layer is actually the same as proposed by |Lee and Bengio| l j2014} 
for backprop-free auto-encoders. What has been added here is a 
justification of this specific formula based on the denoising auto¬ 
encoder theorem from |Alain and Bengio| ( |2013] l, and the empirical 
validation of its ability to climb the joint likelihood for variational 
inference. 


Compared to previous work on auto-encoders, and in particular 
their generative interpretation ( Bengio et al. ,2013 2014), this pa¬ 
per for the first time introduces latent variables without requiring 
back-propagation for training. 


8. Experimental Validation 

Figure [3] shows generated samples obtained after training on 
MNIST with Algorithm [2] (derived from the considerations of 
Sec.s 00 and [6]). The network has two hidden layers, hi with 
1000 softplus units and hi with 100 sigmoid units (which can 
be considered biologically plausible i jGlorot et al. 1 12011} ). We 
trained for 20 epochs, with minibatches of size 100 to speed¬ 
up computation using GPUs. Results can be reproduced from 
code at http://goo.gl/hoQqR5 Using the Parzen den¬ 
sity estimator previously used for that data, we obtain an esti¬ 
mated log-likelihood LL=236 (using a standard deviation of 0.2 
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Figure 5. Examples of filling-in (in-painting) missing (initially corrupted) parts of an image. Left: original MNIST test examples. 
Middle: initial state of the inference, with half of the pixels randomly sampled (with a different corruption pattern in each row of the 
figure). Right: reconstructions using a variant of the INFERENCE procedure of Algorithm[2]for the case when some inputs are clamped. 
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Figure 3. MNIST samples generated by GENERATE from Algo- 
rithm[2]after training with TRAIN. 


for the Parzen density estimator, chosen with the validation set), 
which is about the same or better as was obtained for contrac¬ 
tive auto-encoders < |Rifai et ai \ |20 1 1 j (LL=121), deeper genera¬ 
tive stochastic networks lBengio ~ef a/.||2014|> (L L=214) and gen¬ 
erative adversarial networks (Goodfellow et al ., 2014) (LL=225). 
In accordance with Algorithm [2] the variances of the conditional 
densities are 1, and the top-level prior is ignored during most of 
training (as if it was a very broad, uniform prior) and only set to 
the Gaussian by the end of training, before generation (by setting 
the parameters of p(/i 2 ) to the empirical mean and variance of 
the projected training examples at the top level). Figure [4] shows 
that the targetprop updates (instead of the gradient updates) allow 
the inference process to indeed smoothly increase the joint likeli¬ 
hood. Note that if we sample using the directed graphical model 
p(x\h)p(h), the samples are not as good and LL=126, suggesting 
as discussed in Sec.[5]that additional inference and encode/decode 
iterations move h towards values that are closer to q* (h) (the em¬ 
pirical distribution of inferred states from training examples). The 
experiment illustrated in Figure [5] shows that the proposed in¬ 
ference mechanism can be used to fill-in missing values with a 
trained model. The model is the same that was trained using Al¬ 
gorithm [2] (with samples shown in Figure [3]). 20 iterations steps 
of encode/decode as described below were performed, with a call 



Figure 4. Increase of logp(a;, h) over 20 iterations of the INFER¬ 
ENCE algorithm|2] showing that the targetprop updates increase 
the joint likelihood. The solid red line shows the average and 
the standard error over the full testset containing 10,000 digits. 
Dashed lines show logp(:r, h) for individual datapoints. 


to INFERENCE (to maximize p(x, h ) over h) for each step, with 
a slight modification. Instead of using fi(x) — fi(gi(h)) to ac¬ 
count for the pressure of x upon h (towards maximizing p(x\h)), 
we used /j (x v , g m {h)) — f(g(h)), where x v is the part of x that 
is visible (clamped) while g m (h) is the part of the output of g(h) 
that concerns the missing (corrupted) inputs. This formula was 
derived from the same consideration as for Eq.[T2] but where the 
quantity of interest is 91 ° S g^ ^ rather than 9 ° s P^ h l j and we 
consider that the reconstruction of h, given x v , fills-in the missing 
inputs (x m ) from g m (h). 

9. Future Work and Conclusion 

We consider this paper as an exploratory step towards explain¬ 
ing a central aspect of the brain’s learning algorithm: credit as¬ 
signment through many layers. Of the non-plausible elements of 
back-propagation described in the introduction, the proposed ap¬ 
proach addresses all except the 5th. As argued by Bengio (2014); 
|Lee et af| p014), departing from back-propagation could be use¬ 
ful not just for biological plausibility but from a machine learning 
point of view as well: by working on the “targets” for the inter¬ 
mediate layers, we may avoid the kind of reliance on smoothness 
and derivatives that characterizes back-propagation, as these tech- 
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niques can in principle work even with highly non-linear transfor¬ 
mations for which gradients are often near 0, e.g., with stochastic 
binary units i jLee et aT. |2014[ >. Besides the connection between 
STDP and variational EM, an important contribution of this pa¬ 
per is to show that the “targetprop” update which estimates the 
gradient through one layer can be used for inference, yielding 
systematic improvements in the joint likelihood and allowing to 
learn a good generative model. Another interesting contribution 
is that the variational EM updates, with noise added, can also be 
interpreted as training a denoising auto-encoder over both visible 
and latent variables, and that iterating from the associated Markov 
chain yields better samples than those obtained from the directed 
graphical model estimated by variational EM. 


Many directions need to be investigated to follow-up on the work 
reported here. An important element of neural circuitry is the 
strong presence of lateral connections between nearby neurons 
in the same area. In the proposed framework, an obvious place 
for such lateral connections is to implement the prior on the joint 
distribution between nearby neurons, something we have not ex¬ 
plored in our experiments. For example, Garrig ues and OL| 
shausen (20081 have discussed neural implementations of the in¬ 
ference involved in sparse coding based on the lateral connec¬ 
tions. 


Although we have found that “injecting noise” helped training a 
better model, more theoretical work needs to be done to explore 
this replacement of a MAP-based inference by an MCMC-like 
inference, which should help determine how and how much of 
this noise should be injected. 
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eralized denoising auto-encoders as generative models. In 
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Whereas this paper focused on unsupervised learning, these ideas 
could be applied to supervised learning and reinforcement learn¬ 
ing as well. For reinforcement learning, an important role of the 
proposed algorithms is to learn to predict rewards, although a 
more challenging question is how the MCMC part could be used 
to simulate future events. For both supervised learning and rein¬ 
forcement learning, we would probably want to add a mechanism 
that would give more weight to minimizing prediction (or recon¬ 
struction) error for some of the observed signals (e.g. y is more 
important to predict than x). 

Finally, a lot needs to be done to connect in more detail the pro¬ 
posals made here with biology, including neural implementation 
using spikes with Poisson rates as the source of signal quantiza¬ 
tion and randomness, taking into account the constraints on the 
sign of the weights depending on whether the pre-synaptic neu¬ 
ron is inhibitory or excitatory, etc. In addition, although the op¬ 
erations proposed here are backprop-free, they may still require 
some kinds of synchronizations (or control mechanism) and spe¬ 
cific connectivity to be implemented in brains. 
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Algorithm 2 Inference, training and generation procedures 
used in Experiment 1. The algorithm can naturally be ex¬ 
tended to more layers. /,;() is the feedforward map from 
layer i — 1 to layer i and g t () is the feedback map from layer 
i to layer i— 1, with x = ho being layer 0. 

6 

Define INFERENCE^, N= 15, <5=0.1, a=0.001): 
Feedforward pass: hi fi(x), h 2 f 2 (hi) 
for t = 1 to A' do 

h 2 ^h 2 + 6(f 2 (h 1 )-f 2 (g 2 (h 2 ))) 

hi<^hi+ S(fi(x) - fi{gi{hi))) + a(g 2 {h 2 ) - hi) 

end for 
Return hi, h 2 

Define TRAIN!) 
for x in training set do 

hi , h 2 <- INFERENCE!:/;) (i.e. E-part of EM) 
update each layer using local targets (M-part of EM) 
0 t— 0 + ( gi(hi ) — hi-iY 

0 0 + e-^sj (fi(hi- 1) — hi) 2 

where hi is a Gaussian-corrupted version of hi. For 
the top sigmoid layer h 2 we average 3 samples from a 
Bernoulli distribution with p(h 2 = 1) = h 2 to obtain 
a spike-like corruption, 
end for 

Compute the mean and variance of h 2 using the train¬ 
ing set. Multiply the variances by 4. Define p{h 2 ) as 
sampling from this Gaussian. 

Define GENERATE!): 

Sample h 2 from p(h 2 ) 

Assign hi g 2 {h 2 ) and x <— gi(hi) 
for / = 1 to 3 do 

hi, h 2 f- INFERENCE!:/;,/* = 0.3) 
x <- gi(hi) 

end for 
Return x 

6 




